Main goals of the session

  1. Understand the assumptions and requirements of the statistical tests of neutrality based on polymorphism and divergence data
  2. Calculate and interpreting the results of the HKA-test
  3. Calculate and interpreting the results of the MK-test

1. Practical organization

The main goal of this practical class is to reproduce a few examples of the application of the HKA (Hudson, Kreitman and Aguadé 1987) and MK (McDonald and Kreitman 1991) methods to test for selective neutrality in the recent past of different genomic regions. Both tests are based on important predictions of the Neutral Theory of Molecular Evolution (Kimura 1983). The HKA test contrasts the levels of nucleotide polymorphism (variability within population) with the levels of nucleotide divergence (fixed substitutions between species) for different loci (at least one of these loci is expected to be evolving under neutrality). All neutral loci across the genome must have the same ratio of polymorphism to divergence. This is because both estimates of nucleotide variation are proportional to the mutation rate. The MK test is based on the same assumption but it compares nucleotide variation at different classes of sites within the same gene (i.e. synonymous -or silent- versus non-synonymous sites) rather than across genes. If all amino acid substitutions between two species are neutral, we expect the same number of non-synonymous and synonymous (or silent) changes, the latter reflecting the neutral mutation rate. In this practice, you will work with polymorphism data from two different Drosophila species, Drosophila melanogaster and Drosophila subobscura, and divergence data from other two drosophilid species, Drosophila simulans and Drosophila guanche (see Fig. 1).


Figure 1. Drosophila species used in this practical lesson and their phylogenetic relationships.


Throughout the document you will see different icons whose meaning is:

: Additional or useful information

: Practical exercise

: Hint to solve an exercise or to do a task

: Slot to answer a question

: Problem or task to be solved


2. Installing R packages

You can use either the R console in the terminal or RStudio for this practice. If you don’t have R installed, you can download the appropriate package for your system here. To install RStudio, go to this page and follow the instructions.

Before starting the exercises, you will need to install some necessary libraries for manipulating and analyzing DNA sequence data. Open the R console in the terminal (or in RStudio) and type:

 # install.packages("ape")
 devtools::install_github("pievos101/PopGenome")
 
 install.packages("remotes")
 remotes::install_github("andrewparkermorgan/sfsr")

3. Data files

  1. Data files: link and description (click on the file name to access the file) :

    • rpL32.fas: This file contains the sequence of the rpL32 gene for 18 individuals (one sequence per individual) of a natural population of D. subobscura (J) and one sequence from a closely related species, D. guanche (Dgua). All sequences of D. subobscura are from the same chromosomal arrangement (O3+4). For this practical, use the 18 sequences of D. subobscura as the polymorphism data and the sequence of D. guanche for divergence estimations.

    • Acph.fas. This file contains the sequence of the Acph gene for 42 individuals (one sequence per individual) of a natural population of D. subobscura (TB) and one sequence of a closely related species, D. guanche (Dgua). The sequences of D. subobscura are from two different chromosomal arrangements (O3+4+8 and O3+4+23). For this practical, use the 18 sequences of D. subobscura as the polymorphism data and the sequence of D. guanche for divergence estimations.

    • OS.fas. This file contains the sequence of the OS locus of 14 individuals (one sequence per individual) of an European population of D. melanogaster (M) and 2 individuals of D. simulans (S). For this practical, use the 15 sequences of D. melanogaster as the polymorphism data and the sequence of D. simulans for divergence estimations.

    • E9.fas. This file contains the sequence of the E9 locus of 15 individuals (one sequence per individual) of world wide sample of D. melanogaster (MEL_) and 1 individual (one sequence per individual) of D. simulans (SIM_). For this practical, use the 15 sequences of D. melanogaster as the polymorphism data and the sequence of D. simulans for divergence estimations.

    • E10.fas. This file contains the sequence of the E10 locus for 15 individuals (one sequence per individual) of an African population of D. melanogaster (mel-) and one sequence from a closely related species, D. simulans (Dsim). For this practical, use the 15 sequences of D. melanogaster as the polymorphism data and the sequence of D. simulans for divergence estimations.


3. Example 1. The HKA test

Under neutrality, the amount of polymorphism in a species should be correlated with the levels of divergence between species for all loci across the genome due to the dependence of both on the neutral mutation rate (Kimura 1983). The HKA test evaluates the fit of the observed polymorphism and divergence data to this specific prediction (Fig. 2).


Figure 2. The HKA test.


Open the R console in the terminal (or RSsudio), load the library PopGenome, create a new folder for the gene, copy the fasta file to this folder, and read the folder with the function readData. This function creates an object of class GENOME. Then, indicate which samples are from the population of D. subobscura and of D.guanche using the function set.populatons.

 library(PopGenome)
   
 ## create a new working directory for this gene region and copy the fasta file to this folder
 dir.create("rpL32")
 file.copy(from = "rpL32.fasta",
       to="rpL32/rpL32.fasta")
 
 ## read the fasta file
 rp <- readData("rpL32")
   
 ## summary of the data
 get.sum.data(rp)
   
 ## set populations
 get.individuals(rp)
 rp <- set.populations(rp,list(
         c(get.individuals(rp)[[1]][1:18]), 
         c(get.individuals(rp)[[1]][19]))
         )
 rp@region.data@populations
 rp@region.data@populations2 
 rp<-set.outgroup(rp,c("Dgua"))

To know which indices to use in the function set.populations(), we use the function get.individuals(). Check the result of the commands region.data@populations and region.data@populations2 to make sure you have set the populations correctly.

First, let’s have a look at the variability in the region we are studying. The sliding windonw analysis is very useful for this task. To examine the distribution of the nucleotide polymorphism across the rp32L region, use the sliding.window.transform() function to create a new object (rpsw) where regions correspond to a set of windows into which you divide your region (in the rp object, regions correspond to the entire gene region). You will set a window size of 200 bp and a step size of 10 bp. For example, you can estimate and plot the number of segregating sites across the gene region:

 ## divide the region in windows
 rpsw <- sliding.window.transform(rp,width=200, jump=10,type=2,whole.data=TRUE)
 
 ## calculate segregating sites (and other statistics) 
 rpsw <- neutrality.stats(rpsw)
 
 ## calculate fixed diferences (and other statistics)
 rpsw<-calc.fixed.shared(rpsw)
 
 ## get segregating sites in D. subobscura
 s<-as.data.frame(rpsw@n.segregating.sites[,1])
 
 ## get fixed differences between D. subobscura and D. melanogaster
 d<-rpsw@n.fixed.sites
 
 ## plot the results of the sliding windows analysis
 PopGplot(s,span=0.1,xlab="position (rp32L gene region)", ylim=c(min(s,na.rm=TRUE),max(s,na.rm=TRUE)))
 lines(d)
 legend("topright", legend = c("polymorphism", "divergence"), 
  col = c("red", "black"), lty = 1)

Questions

1. Is the observed variation (visually) in agreement with expectations under the neutral theory of molecular evolution? Why?

Answer:

2. What would the plot look like if positive selection (=selective sweep) had recently acted on the 3’ half of the gene? and if balancing selection had acted on the 5’ half of the gene??

Answer:


To make sure you are on the right track, you can apply the HKA test to both halves of the rpL32 gene region. The hka() function from the sfsr package can be used to do this. However, this function only accepts the site frequency spectrum (SFS) as input for this calculation. Therefore, you must first obtain the SFS of the regions you wish to compare in the HKA test. Just to give you an example, in code bellow you can see how to obtain the SFS and perform the HKA test on the entire rpL32 gene:

 library(sfsr)

 ## extract the sample size
 n<-length(rp@populations[[1]])

 ## get the minor allele frequencies
 rp<-detail.stats(rp)
 ma<-as.data.frame(rp@region.stats@minor.allele.freqs[1]) ## IMPORTANT: see info section bellow

 ma<-ma[1,]

 ## obtain the unfolded site frequency spectrum in population 1 (including fixed differences with respect to poopulation 2)
 ma<-ma*n
 sfs<-c()
 for (i in 1:n) {
    a<-sum(ma==i)
    sfs<-c(sfs, a)
 }
 sfs
 monomorphic<-rp@n.valid.sites - sum(sfs)
 rp.sfs<-c(monomorphic)
 rp.sfs<-c(rp.sfs, sfs)
 
 
 ## plot the SFS
 cols <- c("blue", "red")[(sfs >= sfs[n]) + 1 ]
 barplot(sfs, main="Site Frequency Spectrum", names.arg=c(1:n), col=cols)
 legend("topleft", legend = c("polymorphic", "fixed"), fill=c("blue","red"))
 
 ## run the HKA test:
 rp.sfs2<-rp.sfs ## IMPORTANT: see info section bellow
 hka_rp<-hka_test(rp.sfs, rp.sfs2)
 ## p-value of the HKA test
 hka_rp$p.value
 ## observed values
 hka_rp$observed
 ## expected values
 hka_rp$expected
 ## residuals
 hka_rp$residuals

IMPORTANT: note that in the example above we have compared one SFS against itself, which doesn’t make any sense!!!. In real cases, you must to compare the SFS from two different regions, one of which should be evolving under neutrality. Also note that when you have more than one fasta file in the input folder, the object contains as many regions as fasta files in the folder. Use get.sum.data() to know the order of each region and use the correct index when extracting minor.allele.freqs.

Exercise 1

  • Using the functions and commands you already know (from this and the previous practical), create two fasta files, one with the sequences of the 5’ half (e.g. “rpL32_5.fasta”) and the other with the sequences of the 3’ half (e.g. “rpL32L_3.fasta”) of the rpL32 gene region.
    • Since the alignment is 1798 sites long, divide the region into two parts of exactly equal length (1-899 and 900-1798).
  • Apply the HKA test to the two halves of the rpL32 gene and answer the following questions:

3. Do the two halves of the rpL32 gene region evolve as expected under the neutral model?

Answer:

4. Is there any chance that the region being analysed has been subject to positive selection? Maybe even the entire region?

Answer:


3. The MK test

The MK test also compares polymorphism and divergence data but, in this case, between two types of sites within the same gene (coding region), synonymous (neutral class) and nonsynonymous sites (Figure 3). Under the null hypothesis, all nonsynonymous mutations are expected to be neutral and then the ratio of nonsynonymous to synonymous variation within species (Pn/Ps) is expected to equal the ratio of nonsynonymous to synonymous variation between species (Dn/Ds). However, these ratios will not be equal if part of the nonsynonymous variation is under either positive (i.e., mutations that increase individual fitness) or negative selection (i.e., mutations that are negatively selected but not highly deleterious).


Figure 3. The MK test.


We will use the same genomic region to see an example of how the MK test is applied to real data. The PopGenome package has a function to run this test in R. As noted above, the test is based on synonymous and non-synonymous polymorphisms and substitutions, so the first task is to extract the coding regions from the alignment of the whole genomic region (the original alignment includes non-coding regions):

 library(ape)
 
 ## Create new folder for the cds file in the working directory of this gene
 dir.create("rpL32/rpL32_cds")
 
 ## create a new alignment only with cds sequences
 aln<-read.dna(file="rpL32/rpL32.fasta", format="fasta")
 cds<-aln[,c(932:1024,1122:1430)]
 write.dna(cds, file="rpL32/rpL32_cds/rpL32_cds.fasta", format="fasta")
 
 ## read the cds sequences with PopGenome
 rpc<-readData("rpL32/rpL32_cds")
   
 ## summary of the data
 get.sum.data(rpc)
   
 ## set populations
 get.individuals(rpc)
 rpc <- set.populations(rpc,list(
         c(get.individuals(rpc)[[1]][1:18]), 
         c(get.individuals(rpc)[[1]][19]))
         )
 rpc<-set.outgroup(rpc,c("Dgua"))
 
 ## MK test - fisher test
 rpc<-MKT(rpc, do.fisher.test=TRUE)
 get.MKT(rpc)

Questions

5. What is the result of this test? Can we infer the action of positive selection from the divergence of rpL32 between these two species?

6. How many amino acid changes have occurred in the RpL32 protein since the divergence of D. subobscura and D. guanche? And how many synonymous changes? Discuss these results in relation to the evolutionary rate estimated for this protein in the final assignment of practical 4).

Answer:


Final assignment

  • Run an HKA test comparing the E9 and E10 gene regions of D. melanogaster (using D. simulans for divergence calculations)
  • Run the MK test in the coding region of the OS gene of D. melanogaster (using D. simulans for divergence calculations).
    • The coordinates of the coding region in the alignment are: 2334:2405,2468:2542,2595:2868.
  • Run the MK test in the coding region of the Acph gene of D. subobscura (using D. guanche for divergence calculations).
    • The coordinates of the coding region in the alignment are: 292:475,604:702,769:1458,1522:1780,1838:1944.

Based on the results of these test, answer the following questions:

7. Are the E9 or E10 gene regions under positive selection in D. melanogaster?

Answer:

8. If I say that the region referred to E10 is evolving under neutrality (it is a noncoding region with no functional element), does this modify your answer to question 7?

Answer:

9. Is there evidence of selection in two genes analysed by the MK test? What type of selection (positive or negative)?.

Answer:


Deliver info

Deliver this document in AULAESCI with your answers

Deadline: June 28, 2024 - 23:59

Doubts?